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Abstract —In this paper, we derive a probabilistic registration 
algorithm for object modeling and tracking. In many robotics 
applications, such as manipulation tasks, nonvisual information 
about the movement of the object is available, which we will 
combine with the visual information. Furthermore we do not 
only consider observations of the object, but we also take 
space into account which has been observed to not be part 
of the object. Furthermore we are computing a posterior 
distribution over the relative alignment and not a point estimate 
as typically done in for example Iterative Closest Point (ICP). 
To our knowledge no existing algorithm meets these three 
conditions and we thus derive a novel registration algorithm in 
a Bayesian framework. Experimental results suggest that the 
proposed methods perform favorably in comparison to PCL 
[1] implementations of feature mapping and ICP, especially if 
nonvisual information is available. 

1. INTRODUCTION 

In this paper we will focus on the scenario where the 
camera is fixed and only the object is manipulated. While the 
object is being moved, a 3D camera gathers depth images 
of the object in different orientations and positions. Let us 
denote two such images as image A and image B. The core 
problem considered in this paper is to estimate the rigid 
body transformation T the object has undergone between 
the acquisitions of these two images. Segmentation is not 
the focus of this work, we employ existing algorithms [1] to 
determine whether a pixel in the depth image belongs to the 
object or to the background. 

A great deal of work has been done in this research 
area in the past years. In [2] an algorithm is presented 
which creates 3D models of objects while the camera or 
the object is moved. However, the point clouds have to be 
approximately aligned initially and the model is created off 
line by optimizing the alignment of all images simultane¬ 
ously. Our method is more general in the sense that point 
clouds do not have to be approximately aligned. However, 
task-specific assumptions like that can be introduced to 
significantly reduce the computational time for finding the 
optimal alignment. 

A lot of very promising work, such as [3], [4], has been 
published in the last years about scanning objects while they 
are being held by the robot. We however want to treat a 
more general case where we do not assume that the object 
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is already grasped or can be grasped in a straightforward 
manner. 

In [5] models are constructed by mapping shape primitives 
to the point clouds with promising results. In this work 
however we try to make as few assumptions as possible about 
the shape of the object and thus exclude the use of models 
or shape primitives. 

Among the most popular algorithms that tackle the regis¬ 
tration problem are Iterative Closest Point (ICP) and feature 
mapping algorithms and combinations of both [6], [7], [8], 
[9], [10]. We will compare the proposed method with these 
two approaches. 

ICP has been proven to converge to a local minima [11]. In 
the scenario considered in this paper, an object can move very 
fast and therefore, point clouds of two subsequent images 
are not necessarily approximately aligned. This problem 
is usually tackled by initially aligning point clouds using 
a feature mapping algorithm [6], [8], [9]. These methods 
perform well, if different parts of the object can easily 
be distinguished. For objects with a homogeneous texture, 
color or local shape, feature matching can be problematic. 
Furthermore if the quality of the features degrades with the 
quality of the point cloud, noisy data can cause problems. 

Often in robotics there is a great deal of nonvisual infor¬ 
mation about the transformation of the object available. In 
our scenario, this information can for example be that an 
object is pushed on a table and the movement will therefore 
be in a plane. If it is held by a robot, we approximately 
know how the object will move. This kind of information 
can certainly be incorporated in ICP and feature mapping 
algorithms, but they are not originally designed to do so. 

ICP and feature mapping algorithms commonly optimize a 
cost function that is only dependent on the relative alignment 
between two point clouds. In our proposed method, we 
take into account the space which has been observed to 
not contain any part of the object. Our results suggest 
that taking this information into account leads to more 
robust registration results. Introducing visibility constraints 
has previously been shown to help in estimating the occluded 
shape of an unknown object [12]. 

Finally, feature mapping and ICP algorithms usually return 
a point estimate of the transformation and a fitness. It can 
however be preferable to have a more differentiated estimate 
of the transformation in form of a probability distribution 
over the 6 parameters of the transformation. This allows us 
to express, for example, that we are certain about the rotation 
around axis x but uncertain about the translation in y etc. In 
the results section we will show an example of the use of a 



probability distribution as result. 

To our knowledge, there is no registration algorithm that 
combines the three mentioned points: 

1) Cost function based on visibility constraints. 

2) Output of a posterior distribution over the estimated 
object pose change. 

3) Straightforward incorporation of task-relevant nonvi¬ 
sual information. 

In the next section, we will derive the proposed registration 
algorithm in a Bayesian framework. In the result section, we 
show that under certain conditions that are quite common 
in the scenario of object model learning and tracking, our 
algorithm outperforms implementations of ICP and feature 
mapping methods. 

II. DERIVATION 

A. Incorporated Information 

An overview of all the information we will make use of 
can be seen in Fig. 1. The input data D consists of the visual 
information V and the nonvisual information N. 
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Fig. 1. Overview of the variables 


1) Nonvisual Information: In the context of a robotic 
manipulation task often a great deal of nonvisual information 
about the movement of an object is available. N can contain 
for example the information that the object will be moved 
on a table, that the robot has poked it with a certain 
movement or that the object is being held by the robot, and 
we thus know how it has moved approximately. 

2) Visual Information: We divide the visual information 
into two types (see Fig. 2). Firstly there are surface patches 
which are observed by the depth camera, from now on 
referred to as patches P. These patches can be represented 
as a point cloud and are thus the only information used by 
ICP and by most feature mapping algorithms. 

There is however another very important piece of informa- 



Fig. 2. Two types of visual information: Surface patches P and mask M. 
tion. No part of the object is inside the green area in Fig. 2, 


this area defines thus a mask M for the object. 

We will always register two depth images, A and P, at a 
time, therefore we have of course the masks, Ma and Mb, 
as well as the patches. Pa and Pb, from each image (see 
Fig. 1). 

B. Parametrization 

I) Coordinate system: Given that we will work with depth 
images we choose a suitable parametrization assuming the 
pinhole model for the camera. The first two parameters, w 
and h, are chosen to be the projections of a 3D point onto a 
virtual image plane given a focal length of Im, see Fig. 3. 
The third parameter r is the depth of the 3D point. These 



Fig. 3. Schematic representation of eight pixels acquired by the depth 
camera. The coordinates r and w are represented, while h would be 
perpendicular to the image plane. 

coordinates will be called ray coordinates. They are derived 
from Cartesian coordinates as follows: 

ic = —, h = -, r = x‘^ + y‘^ ^ z‘^ (1) 

2 ) Rigid body transformation: The rigid body transforma¬ 
tion T has six independent parameters T = (Ti,..., Te)^. 
The parametrization can be chosen to be whatever is conve¬ 
nient for a given application. 

C. Measurement Error 

Due to measurement errors in the camera, an observed 3D 
point p will not exactly correspond to the true point s on 
the object surface. As a measurement model p{p\s) we use 
a normal distribution in ray coordinates. 

p{p\s) = Af {s\p, L) ( 2 ) 

The covariance matrix L is camera specific. The only 
assumption we make in our derivation is that the covariance 
matrix is such that p{p\s) can be reasonably well 
approximated as being constant within a pixel. This 
assumption is sensible because the depth camera is not able 
to distinguish between points within the range of one pixel. 
Furthermore, assuming that p{p) and p{s) are uniform in 
the range of the depth camera, we have p{s\p) = p{p\s). 

D. Derivation 

Our objective is to express p{T\D), the probability distri¬ 
bution over the transformation T the object has undergone, 
given all the available data D. Applying Bayes we have 

p{T\D)=p{T\N,P,M) (3) 

^ p{M\N,P,T)p{T\N,P) 

f p(MlN, P, T)p{T\N, P)dT ^ ’ 













In p{M\N, P,T), given the transformation T, the mask M 
does not depend on the nonvisual information N and we thus 
have 


p{M\N,P,T)=p{M\P,T) (5) 

= p{Ma,Mb\Pa.Pb.T) (6) 

We assume Ma and Mb to be independent because the 
mask observed in one image does not give us any useful 
information about the mask observed in the other image. 

p{M\N, P,T) = p{Ma\Pa. Pb,T)p{Mb\Pa. Pb.T) (7) 

As Ma and Pa are from the same image, the object Pa will 
necessarily be respected in the mask Pa does thus not 
add any information to the first term and can be removed. 
Similarly, for the second term we can omit Pb- 

p{M\N,P,T)=p{Ma\Pb.T)p{Mb\Pa.T) (8) 

It is reasonable to assume that the priors p{M\T) and 
p{P\T) are uniform because we do not have any prior 
information about the distribution of the points and the mask. 
Applying of Bayes’ rule, we thus have 

p{M\N,P,T) = kp{PB\MA,T)p{PA\MB,T) (9) 

with k being a constant. 


Inserting this result into Eq. 4 we obtain 

^ p{Pb\Ma.T)p{Pa\Mb,T)p{T\N,P) 

' ’ jp{PB\MA.T)p{PA\MB,T)p{T\N,P)dT 

Finding this distribution is intractable, but for most pur¬ 
poses we do not need the distribution itself, we only use 
it for evaluating expectations. We thus need to find the 
expectation of a function /(T) expressing a property of 
T required for a given application. If / is for example 
identity (/(T) = T), then P(/(T)) = P(T), or if /(T) = 
(T - P(T))(T - E{T)y then E{f{T)) is the covariance 
matrix. The expectation of a function of T is 


E{f{T)) = (10) 

[ p{Pb\Ma,T)p{Pa\Mb,T)p{T\N,P) 

J Jp{PB\MA,T)p{PA\MB,T)p{T\N,P)dT^^ ’ 


L 

E(/(r))«^«;«/(r«) 

1=1 


( 11 ) 


Where the samples are drawn from p(T|A^, P). The 
sampling weights are defined by 

Y.t=lP{PB\MA.T^^^)p{PA\MB,T^^'^) \ 

We thus have represented p{T\D) by a set of samples 
and the corresponding weights The samples 

are drawn from p(T|A^, P), in other words, we will create a 
distribution, from which it is possible to sample, taking into 
account the nonvisual information as well as the observed 
surface patches. This distribution will be defined indepen¬ 


dently for a given application, an example is discussed in 
the results section. 

The terms p{Pa\Mb,T) and p{Pb\Ma,T) determine the 
weight of a given sample. The first one expresses the 
likelihood of T given the patches observed in A and the mask 
observed in P. It essentially states that the transformation T 
has to be such that the patches observed in A fit into the 
mask observed in B. Conversely the second term assures 
that the patches from B fit into the mask from A. 

Now we will express p{Pa\Mb^T). Pa is the set of all the 
surface patches observed in image A and can be expressed 
as a set of points {ai, a 2 ,a^} . Similarly we have 
Pb = {^ 1 , ^ 2 , •••, bm}- We can now write 

p{PB\MA,T)=p{bub2,...,bm\MA,T) (13) 


Given the mask Ma observed in image A, we look at the 
points Pb observed in image B as independent observations: 


p{Pb\Ma,T) = llp{bj\MA,T) 
i=i 


(14) 


After the derivation in Appx. A we have 


n 

p{bj\MA,T) 

i=l 


{l + erf{v^{[bj]A-ai))) 


( 15 ) 


with P, V, P 2 , Z defined in Appx. A. The second term in 
Eq. 12, p{Pa\Mb,T), can be expressed analogously. 

E. Discussion 

The first term in Eq. 15 is a Gaussian over the parameters 
w^h with mean {wi^hi)^. This term accounts for the fact 
that the closer [bj]A is to a pixel i, the likelier it is that the 
point which has been observed at bj in image B is observed 
in pixel i in image A. The second term goes to zero if the 
depth of [bj]A is smaller than the depth at the pixel where 
it is projected on in image A, which is necessary in order to 
respect the mask M^. 

Given that p{Pb\Ma,T) (see Eq. 14) is the product of all 



Fig. 4. Schematic representation of p{bj\MA,T) (blue), p{ai\MB,T) 
(red), {ai, a2,an} (blue dots) and {61,62, •••, bm} (red dots) 

p{bj\MA^T), it is zero if any p{bj\MA^T) is zero. This 
result is illustrated in Fig. 4, all of the red points have to be 
inside the blue area. 









III. Implementation 

The only parameter that has to be determined for our 
algorithm is the covariance matrix of the camera uncertainty 
(Eq. 2). This is however not a parameter that has to be 
optimized, it represents a meaningful quantity and should be 
estimated for the depth camera that is used. For our exper¬ 
iments with the Kinect camera we estimated the covariance 
matrix to be isotropic with a = 0.002, which corresponds 
approximately to the resolution in ray coordinates of the 
Kinect. These values are a very rough estimation of the 
properties of the Kinect, but they prove to work well in the 
experiments. 

The core of the algorithm looks as follows: 

• For K samples 

- Sample from p(T\N, P) 

- For all points in B 

* if p{bj\MA, T) is zero, sample a new transform 

* c\scp{Pb\Ma,T^'''>) *=p{bj\MA,T) 

- Do the same for points in A 

• Given all the p{Pa\Mb,T^^^) and p{Pb\Ma,T^^^) we 

can compute the covariance matrix and the mean of T 
according to Eq. 11 and Eq. 12. 

IV. Results 

As mentioned in the introduction, the algorithms we want 
to compare against are ICP and feature mapping. We used the 
implementations in the Point Cloud Eibrary (PCE) of these 
algorithms for our evaluation. We employed FPFH features 
which are described in [10]. 

Our dataset consists of three objects, a box, a flashlight and 



Fig. 5. Box, flashlight and tube. 


a tube. Our dataset is small, the three objects however have 
a big variety in shape as seen in Fig. 5, and therefore this 
evaluation gives us a reasonable idea about the performance 
of our algorithm. Admittedly a broader evaluation will be 
necessary for a more precise assessment of the performance. 
In the associated video the algorithm is applied to a series 
of different objects on a tabletop 

Each of the three objects has been rotated in steps of about 
25^ and translated by a few cm 14 times on a tabletop. At 
each step we acquire a depth image and measure the object’s 
exact position and orientation which will be used as ground 
truth. For evaluation we will align each image to the next, 
which gives a total of 13 alignments per object. 

^ http://youtu.be/oWiNbItu2yM 


We compare our algorithm to ICP, feature mapping and 
feature mapping with subsequent ICP. We use the implemen¬ 
tations of these algorithms in the Point Cloud Eibrary (PCE) 
[1]. The feature mapping algorithm uses Fast Point Feature 
Histograms (FPFH) as shape features [10]. We used these 
algorithms to our best knowledge and implemented them 
as suggested in tutorials of PCE. We do not claim that the 
performance we measure here for ICP and feature mapping 
is the maximum that can be achieved with these algorithms, 
but it serves as a good point of comparison for our new 
algorithm. 

A. Evaluation of Alignment Performance without Nonvisual 
Information 

In order to obtain a general estimate of the alignment 
performance of our algorithm we only make very general 
assumptions for the sampling distribution p{T\N^Pa^Pb)^ 
We will assume that we have no information A^, we do thus 
not use the information that object has only been translated 
and rotated on a table top. We only assume that the center 
of mass of Pa will be no further than 4cm from the center 
of mass of Pb in the aligned images and that the object will 
not be rotated by more than 50 degrees at a time. Note that 
these assumptions leave a very big search space open, and, 
therefore, we have to draw a very large amount of samples 
- about 100 million - and the algorithm is thus slow and 
takes about 30 seconds per image. ICP took about 1 second 
and feature mapping took about 5. In practice however we 
will have much stronger sampling distributions which will 
accelerate our algorithm considerably. 



Fig. 6. Boxplot of alignment error for different algorithms. 

In Fig. 6 we present the box-plots of the alignment error 
in degree of the four algorithms for all the objects. Our algo¬ 
rithm performs favorably compared to these implementations 
of ICP and feature mapping. We will now try to investigate 
how this advantage emerges. 

Fig. 7 shows an alignment performed by ICP with an error 
of 32°. The top image shows the aligned point clouds. The 
two bottom images represent the information about the mask. 
The left one illustrates p{[bj]A\MA, T). In the blue area the 
object has been observed, in the gray area background has 
been observed, and in the black area no observation has been 

































Fig. 7. Flashlight aligned by ICR Alignment error = 32^. 


made. The red points represent the points observed in B 
projected into image A, [by] a- The result of our derivation 
suggests that the red points can only be in the blue or black 
area. If a point [by ]a is located on a pixel ay in the blue area 
its distance to the camera r has to be approximately equal or 
larger than the depth measured at ay. If the point is located 
in the black area its distance to the camera can be arbitrary 
because no depth has been measured at the corresponding 
pixel. 

ICP however only uses the information contained in the 
point clouds, which are quite sparse in the considered images. 
Looking at the top image of Fig. 7 it does not surprise 
that ICP performs poorly on this data. If we look at the 
two bottom images however we can see that many of the 
projected points are in front of the background. Taking this 
information into account we thus know that this alignment is 
not correct. In Fig. 8 the alignment of the same two images 
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Fig. 8. Flashlight aligned by new algorithm. Alignment error = 2.1°. 


by our algorithm is shown. Even though the point clouds are 
quite sparse it has performed well thanks to the information 
about the mask. 

Fig. 9 illustrates a problem of a different nature that 
occurred with feature mapping and ICP. The problem here 
is that this box, looking only at the point clouds, allows 
different alignments. The red point cloud should be rotated 
about 90° to the left. The box fortunately is a little bit broader 
than wide. Taking the mask into account we can thus resolve 
this ambiguity. In the small image on the right on the bottom 
we see that many blue points are in front of the background 
which enables our algorithm to discard this alignment. Our 
algorithm aligned these images with an error of 2.7°. 

These results illustrate that taking the mask into account 
can resolve important problems. 


Fig. 9. Box aligned by feature mapping followed by ICP. Alignment error 
= 86 °. 

B. Evaluation of Alignment Performance with Nonvisual 
Information 

Now we will show the benefits of taking nonvisual in¬ 
formation N about the transformation into account. The 
dataset we have been working on consists of translations 
and rotations on a tabletop. Before we did not make use 
of this information. Now we include this information in the 
sampling distribution of our algorithm. We thus only sample 
from translation and rotations in the plane of the table. Of 
course our search space is much smaller now, and therefore 
we need less samples. The computational time is reduced to 
about 0.5 seconds per alignment. The additional information 
of course also contributes to the alignment performance as 
we can see in Fig. 11. There may be ways to make use of 
nonvisual information in ICP and feature mapping algorithms 
as well, it does however not emerge naturally and we did 
not try to do so. In Fig. 11 two very sparse point clouds 
are shown. Even for a human it is hard to tell how these 
should be aligned. ICP, feature mapping and feature mapping 
with subsequent ICP all aligned these point clouds with an 
error of at least 51°. Our algorithm without the table prior 
produced an error of 11°. With the prior however these 
points are aligned with an error of only 6.3°. This example 
























Fig. 10. Boxplot of alignment error for different algorithms. 



Fig. 11. Very sparse point clouds of flashlight aligned with help of 
nonvisual information. Alignment error = 6.3°. 

illustrates that the combination of nonvisual information and 
the information from the mask can be complementary. Even 
this point cloud of very bad quality has been aligned almost 
correctly. 

C. Evaluation of Alignment Performance with Loop-closure 

We argued that as output of the alignment we prefer a 
probability distribution to a point estimate. As an example 
why this is useful we will merge all the point clouds of the 
box, aligned by our algorithm with the table prior, into one 
point cloud, as shown in Fig. 12. Frame 1 has been aligned 
to image 2, image 2 to image 3 and so on. Between the first 
and the last image the object has been rotated by around 
360°, we can thus close the loop and align the last to the 
first image. Therefore we now have redundant information 
about the transformation of each image, and can optimize 
these transformations. This optimization is problematic if 
we only have a point estimate of each transformation. 
Fortunately however we can compute the mean as well as 


the covariance matrix of each transformation. We can thus 
estimate the probability of each transformation assuming that 
its distribution is normal with covariance matrix and mean as 
computed by our algorithm. We numerically maximize the 
joint probability of all the transformations using the graph 
optimization algorithm described in [13]. The difference 



Fig. 12. The transformations on the right have been optimized, on the left 
not. 

between the transformations which are optimized and the 
ones which are not is illustrated in Fig. 12. This illustrates 
one of the advantages of having a posterior distribution rather 
than a point estimate. 

V. Conclusion and future work 

The results of our evaluation are promising, but for a 
full assessment of the performance of our novel algorithm 
many more experiments are necessary. The derivation of this 
algorithm is general and does not assume in any way that 
the object is on top of a table. Note that we only used 
this information where we explicitly mentioned it. It might 
however be favorable for the performance of our algorithm 
because the depth camera always manages to measure the 
depth on pixels which are on the table top. This gives us 
a lot of information about the mask. The next step will be 
to measure the performance of the algorithm in other cases, 
such as when the object is held by the robot hand. 

The core of our algorithm is sampling, it can thus easily 
be parallelized or even implemented on a GPU in order to 
reduce the computational time. 

In our sampling distribution p(T\N^Pa^Pb) we have 
barely used the information coming from Pa^Pb- This 
information is not very important if we already have a good 
idea how the object has moved given A". If we have however 
no nonvisual information about how the object has moved we 
can make assumptions based on Pa^Pb- These assumptions 
are specific for a given application. If we know for example 
that we will observe only objects that are much longer in 
one dimension than in the others, then we can assume that 
the first Principal Component of A is approximately aligned 
with the first Principal Component of B. Another possibility 
is to employ features. If we compute features for each point 
in A and B we can create a set of possible matches. Then 
we can sample from these matches, three at a time, which 


























gives us a sample for T. We might however inherit problems 
of feature mapping algorithms. 

When we ran the algorithm on the robot we moved its arms 
manually. This was of course only for evaluation, a possible 
application of the algorithm is to be used in the context of 
manipulation tasks. There are numerous possibilities, such 
as using the algorithm in a grasping pipeline. If the robot 
encounters, for example, an object which does not have an 
obvious associated grasp observing it from only one side, we 
can start poking it with actions that minimize the uncertainty 
in the alignment. While the object is being moved around, 
our algorithm tracks it and completes a model. The more 
information we gain, the more likely are we to select the 
correct grasp. 

In summary, we can say that there are many applications 
and possible extensions for this algorithm. Its most important 
feature is that due to its general formulation, it can make use 
of all the information available in a given case. 

References 

[1] R. B. Rusu and S. Cousins, “3D is here: Point Cloud Library (PCL),” 
in The IEEE International Conference on Robotics and Automation 
(ICRA), 2011. 

[2] Y. Cui, S. Schuon, C. Derek, S. Thrun, and C. Theobalt, “3D Shape 
Scanning with a Time-of-Flight Camera,” Proc. IEEE Computer Vision 
and Pattern Recognition, 2010. 

[3] M. Krainin, P. Henry, X. Ren, and D. Fox, “Manipulator and Object 
Tracking for 3D Object Modeling,” International Journal of Robotics 
Research, 2011. 


[4] W. Ho Li and L. Kleeman, “Interactive Learning of Visually Symmet¬ 
ric Objects.” in Proc. lEEE/RSJ Inti Conf on Intelligent Robots and 
Systems, 2009, pp. 4751-4756. 

[5] R. B. Rusu, N. Blodow, Z. C. Marton, and M. Beetz, “Close-range 
Scene Segmentation and Reconstruction of 3D Point Cloud Maps for 
Mobile Manipulation in Domestic Environments,” in Proc. lEEE/RSJ 
Inti Conf. on Intelligent Robots and Systems, 2009. 

[6] J. Dai and J. Yang, “A Novel Two-Stage Algorithm for Accurate 
Registration of 3-D Point Clouds,” in Proc. Inti Conf. on Multimedia 
Technology (ICMT), 2011, pp. 6187 - 6191. 

[7] Y. Liu and M. A. Rodrigues, “Accurate Registration of Structured Data 
using Two Overlapping Range Images.” in Proceedings of the IEEE 
Inti Conf. on Robotics and Automation, 2002, pp. 2519-2524. 

[8] R. Bogdan Rusu, N. Blodow, Z. C. Marton, and M. Beetz, “Aligning 
Point Cloud Views using Persistent Feature Histograms,” in Proc. 
lEEE/RSJ Inti Conf. on Intelligent Robots and Systems, 2008. 

[9] H. Fukai and G. Xu, “Fast and robust registration of multiple 3d point 
clouds,” in Inti Symp. on Robot and Human Interactive Communica¬ 
tion (RO-MAN), 2011, pp. 331 - 336. 

[10] R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms 
(fpfh) for 3d registration,” in The IEEE International Conference on 
Robotics and Automation (ICRA), 2009. 

[11] P. J. Besl and N. D. McKay, “A method for registration of 3-d shapes,” 
IEEE Trans, on Pattern Analysis and Machine Intelligence (PAMI), 
vol. 14, pp. 239-256, 1992. 

[12] J. Bohg, M. Johnson-Roberson, B. Leon, J. Felip, X. Gratal, 
N. Bergstrom, D. Kragic, and A. Morales, “Mind the Gap - Robotic 
Grasping under Incomplete Observation,” in IEEE International Con¬ 
ference on Robotics and Automation (ICRA), May 2011. 

[13] G. Grisetti, R. Kuemmerle, H. Strasdat, and K. Konolige, “g2o: A 
general framework for graph optimization,” in The IEEE International 
Conference on Robotics and Automation (ICRA), 2011. 



Appendix 


A. Derivation of p{h\MA^T) 

As explained in Appx. B, the transformation from ray coordinates in image A to ray coordinates in image B is not 
linear, it can however be approximated linearly around a point. Point b is expressed in ray coordinates in image B and a 
is expressed in ray coordinates in image A. ^ 

p{b\MA^T)= [ p{b\a,T)p{a\MA)da (16) 


p(6|a, T) expresses the probability distribution over the position of a point b observed in B given that we have observed 
the same point at a in A. With s being the underlying point, expressed in ray coordinates in image A, we can write 

oo oo 

p{b\a,T) = / p{b\s,T)p{s\a,T)ds and inserting Eq. 2 we obtain = J fJ'{b\[s]B, L)fJ'{s\a, L)ds (17) 

— oo —oo 

Given that [s] b is only relevant in the neighborhood of b we can replace [s] b by its linear approximation around b obtained 
in Appx. B: 


pib\a,T) = j Mib\b + QBRQAHs-[b]A),L)M{s\a,L)ds 


p{b\a,T) = Kie-V'^-^f>UVHo.-[b]A) 


= 


, A ^ — QaR ^Qb^^Qb RQa ^ 


-1^-1 


{27r)y^\L^QBRQA^LQ^^ R^Qlf/^ 


(18) 

(19) 

( 20 ) 


As explained in the assumptions section, the whole term inside the integral of Eq. 16 can be approximated as being 
constant within the range of a pixel. The integral over w and h thus becomes a sum over the number of pixels n: 


n ^ 

p(6|M^,T) oc / p{b\wi,hi,r)p{wi,hi,A^A)dr 
i=lA 


( 21 ) 


p{wi, hi,r\MA) is the probability distribution over the observation of b in A, given the mask M^. This probability distribution 
is equal to zero in the green area of Eig. 2 because we know that no part of the object has been observed there. Everywhere 
else it is uniform because we have no further information, considering only the mask. The green area, for a pixel {wi, hi), 
corresponds to the range between the camera and the depth measured at the aforesaid pixel (r^). Therefore the probability 
distribution is equal to zero for r < Vi and uniform for r > Ti. This can easily be translated into limits for the integral, and 
we obtain ^ 

p(6|M^,T)oc p{b\wi,hi,r)dr (22) 

i=i i 

' i 


We can now integrate and obtain 

n 

p{b\MA, T) oc iiTs y] e-V[bU-o.i)l,uDmA-ai)„,H (1 erf{v^{[b]A - a^)) 

i=l 


A-i = QaR-^QPLQI~"rq\ + L, K2 = 


\L + QbRQaBQ-/ RjQl\y^ 


D = 


A: 


33 


A11A33 — 

A33A21 — A31A32 

1 

— 

A31 

A32 

_A 33 _ 

_A 33 A 21 — A31A32 

A22A33 - A32 

V2A33 


(23) 

(24) 

(25) 


B. Linear approximation to ray coordinate transformation 

We want to linearly approximate the transformation from ray coordinates in image A, to ray coordinates in image B, 
around the point b, which is defined in ray coordinates in image B. With s defined in image A, it is straightforward to 
show that 




~ dw 

dw 

dw ~ 


~ dw 

dw 

dw~ 



dx 

dy 

dz 


dx 

dy 

dz 

[s]b ~ 6 + QBRQA^i^ ~ [^]a) 

with Qa = 

dh 

dx 

dh 

dy 

dh 

dz 

and Qb = 

dh 

dx 

dh 

dy 

dh 

dz 



dr 

dr 

dr 


dr 

dr 

dr 



_ dx 

dy 

dz _ 

[bU 

_ dx 

dy 

dz _ 


( 26 ) 

















